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Abstract 

We describe an approach to parallel compilation that seeks to harness the vast amount 
of fine-grain parallelism that is exposed through partial evaluation of numerically- 
intensive scientific programs. We have constructed a compiler for the Supercomputer- 
Toolkit parallel processor that uses partial evaluation to break down data abstractions 
and program structure, producing huge basic blocks that contain large amounts of 
fine-grain parallelism. We show that this fine-grain parallelism can be effectively uti- 
lized even on coarse-grain parallel architectures by selectively grouping operations 
together so as to adjust the parallelism grain-size to match the inter-processor com- 
munication capabilities of the target architecture. 
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1 Introduction 

One of the major obstacles to compiling parallel pro- 
grams is the question of how to automatically identify 
and exploit the underlying parallelism inherent in a pro- 
gram. We have implemented a compiler for parallel 
programs that uses novel techniques to detect and ef- 
fectively utilize the fine-grained parallelism that is in- 
herent in many numerically-intensive scientific computa- 
tions. Our approach differs from the current fashion in 
parallel compilation, in that rather than relying on the 
structure of the program to detect locality and paral- 
lelism, we use partial evaluation[5] to remove most loops 
and high-level data structure manipulations, producing 
a low-level program that exposes all of the parallelism 
inherent in the underlying numerical computation. We 
then use an operation-aggregating-technique to increase 
the grain-size of this parallelism to match the communi- 
cation characteristics of the target parallel architecture. 
This approach, which was used to implement the com- 
piler for the Supercomputer Toolkit parallel computer[l], 
has proven highly effective for an important class of 
numerically-oriented scientific problems. 

Our approach to compilation is specifically tailored 
to produce efficient statically scheduled code for par- 
allel architectures which suffer from serious inter- 
processor communication latency and bandwidth limi- 
tations. For instance, on the eight processor Supercom- 
puter Toolkit system in operation at M.I.T., six cycles 1 
are required before a value computed by one processor is 
available for use by another, while bandwidth limitations 
allow only one value out of every eight values produced 
to be transmitted among the processors. Despite these 
limitations, code produced by our compiler for an im- 
portant astrophysics application 2 runs 6.2 times faster 
on our eight-processor system than does near-optimal 
code produced for a uniprocessor system. 3 

Interprocessor communication latency and bandwidth 
limitations pose severe obstacles to the effective use of 
multiple processors. High communication latency re- 
quires that there be enough parallelism available to al- 
low each processor to continue to initiate operations even 
while waiting for results produced elsewhere to arrive. 4 
Limited communication bandwidth severely restricts the 



A "cycle" corresponds to the time required to perform a 
floating-point multiplication or addition operation. 

Stormer integration of the 9-body gravitational attrac- 
tion problem 

3 The code produced for the uniprocessor was also partially 
evaluated, to ensure that the factor of 6.2 speedup is entirely 
due to parallel execution. 

[5] (page 35) describes how the effect that interprocessor 
communication latency has on available parallelism is similar 
to that of increasing the length of an individual processor's 
pipeline. In order to continue to initiate instructions on a 
heavily pipelined processor, there must be operations avail- 
able that do not depend on results that have not yet emerged 
from the processor pipeline. Similarly, in order to continue to 
initiate instructions on a parallel machine that suffers from 
high communication latency, there must be operations avail- 
able that do not depend on results that have not yet been 
received. 



parallelism grain-size that may be utilized by requir- 
ing that most values used by a processor be produced 
on that processor, rather than being received from an- 
other processor. We overcome these obstacles by com- 
bining partial evaluation, which exposes large amounts 
of extremely fine-grained parallelism, with an operation- 
aggregating-technique that increases the grain-size of 
the operations being scheduled for parallel execution to 
match the communication capabilities of the target ar- 
chitecture. 

2 Our Approach 

We use partial evaluation to eliminate the barriers to 
parallel execution imposed by the data representations 
and control structure of a high-level program. Par- 
tial evaluation is particularly effective on numerically- 
oriented scientific programs since these programs tend 
to be mostly data-independent, meaning that they con- 
tain large regions in which the operations to be per- 
formed do not depend on the numerical values of the 
data being manipulated. 5 As a result of this data- 
independence, partial evaluation is able to perform in 
advance, at compile time, most data structure refer- 
ences, procedure calls, and conditional branches related 
to data structure size, leaving only the underlying nu- 
merical computations to be performed at run time. The 
underlying numerical computations form huge sequences 
of purely numerical code, known as basic blocks. Of- 
ten, these basic blocks contain several thousand instruc- 
tions. The order in which basic blocks are invoked is 
determined by data-dependent conditional branches and 
looping constructs. 

We schedule the partially-evaluated program for par- 
allel execution primarily by performing the operations 
within an individual basic block in parallel. This is prac- 
tical only because the basic blocks produced by partial 
evaluation are so large. Were it not for partial evalua- 
tion, the basic blocks would be two orders of magnitude 
smaller, requiring the use of techniques such as software 
pipelining and trace scheduling, that seek to overlap the 
execution of multiple basic blocks. Executing a huge ba- 
sic block in parallel is very attractive since it is clear 
at compile time which operations need to be performed, 
which results they depend on, and how much computa- 
tion each instruction will require, ensuring the effective- 
ness of static scheduling techniques. In contrast, par- 
allelizing a program by executing multiple basic blocks 
simultaneously requires guessing the direction that con- 
ditional branches will take, how many times a particular 
basic block may be executed, and how large the data 
structures will be. 

Our approach of combining partial evaluation with 
parallelism grain size selection was used to implement 
the compiler for the Supercomputer Toolkit parallel 
processor. 6 [1] The Toolkit compiler operates in four ma- 



For instance, matrix multiplication performs the same set 
of operations, regardless of the particular numerical values of 
the matrix elements. 

6 See Appendix for a brief overview of the architecture of 
the Supercomputer Toolkit parallel processor" 
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Figure 1: Four phase compilation process that produces par- 
allel object code from Scheme source code. 



jor phases, as shown in Figure 1. The first phase per- 
forms partial evaluation, followed by traditional com- 
piler optimizations, such as constant folding and dead- 
code elimination. The second phase analyzes locality 
constraints within each basic block, locating operations 
that depend so closely on one another that it is clearly 
desirable that they be computed on the same processor. 
Closely related operations are grouped together to form 
a higher grain-size instruction, known as a region. The 
third compilation phase uses heuristic scheduling tech- 
niques to assign each region to a processor. The final 
phase schedules the individual operations for execution 
within each processor, accounting for pipelining, mem- 
ory access restrictions, register allocation, and final allo- 
cation of the inter-processor communication pathways. 

3 The Partial Evaluator 

Partial evaluation converts a high-level, abstractly writ- 
ten, general purpose program into a low-level program 
that is specialized for the particular application at hand. 
For instance, a program that computes force interactions 
among a system of N particles might be specialized to 
compute the gravitational interactions among 5 plan- 
ets of our particular solar system. This specialization 
is achieved by performing in advance, at compile time, 
all operations that do not depend explicitly on the ac- 
tual numerical values of the data. Many data structure 
references, procedure calls, conditional branches, table 
lookups, loop iterations, and even some numerical oper- 
ations may be performed in advance, at compile time, 
leaving only the underlying numerical operations to be 
performed at run time. 

The Toolkit compiler performs partial evaluation us- 
ing the symbolic execution technique described in [4]. 
The partial evaluator takes as input the program to be 
compiled, as well as the input data structures associated 
with a particular application. Some numerical values 
within the input data structures will not be available at 
compile time; these missing numerical values are rep- 



HIGH-LEVEL PROGRAM: 

(define (square x) (* x x)) 

(define (sum-of-squares L) 
(apply + (map square L))) 

(define input-data 
(list 

(make-placeholder 

'floating-point) ; ;placeholder #1 
(make-placeholder 

'floating-point) ; ;placeholder #2 
3.14)) 

(partial-evaluate (sum-of-squares input-data)) 

PARTIALLY-EVALUATED PROGRAM: 

(INPUT 1) ; ;numerical value for placeholder #1 
(INPUT 2) ; ;numerical value for placeholder #2 

(ASSIGN 3 

(floating-point -multiply (FETCH 1) (FETCH 1))) 
(ASSIGN 4 

(floating-point -multiply (FETCH 2) (FETCH 2))) 
(ASSIGN 5 

(floating-point -add (FETCH 3) (FETCH 4) 9.8596)) 

(RESULT 5) 



Figure 2: Partial evaluation of the sum-of-squares pro- 
gram, for an application where the input is known to be 
a list of three floating-point numbers, the last of which 
is always 3.14. Notice how the squaring of 3.14 to pro- 
duce 9.8596 took place at compile time, and how all list- 
manipulation operations have been eliminated. 



resented by a data structure known as a placeholder. 
The data-independent portions of the program are ex- 
ecuted symbolically at compile time, allowing all oper- 
ations that do not depend on missing numerical values 
to be performed in advance, leaving only the lowest-level 
numerical operations to be performed at runtime. This 
process is illustrated in Figure 2, which shows the result 
of partially evaluating a simple sum-of-squares program. 
Although partial evaluation is highly effective on the 
data-independent portions of a program, data-dependent 
conditional branches pose a serious obstacle. Data- 
dependent conditional branches interrupt the flow of 
compile time execution, since it will not be known until 
runtime which branch of the conditional should be exe- 
cuted. Fortunately, most numerical programs consist of 
large sequences of data-independent code, separated by 
occasional data-dependent conditional branches. 7 We 
partially evaluate each data-independent segment of a 



Some typical uses of data-dependent branches in scien- 
tific programs are to check for convergence, or to examine the 
accumulated error when varying the step-size of a numerical 
integrator. These uses usually occur after a long sequence of 
data-independent code. Indeed, the only significant excep- 
tion to this usage pattern that we have encountered is when 



program, leaving intact the data-dependent branches 
that glue the data-independent segments together. 8 In 
this way, each data-independent program segment is con- 
verted into a sequence of purely numerical operations, 
forming a huge basic block that contains a large amount 
of fine-grain parallelism. 

4 Exposing Fine-Grain Parallelism 

Each basic block produced by partial evaluation may 
be represented as a data-independent (static) data-flow 
graph whose operators are all low-level numerical opera- 
tions. Previous work has shown that this graph contains 
large amounts of low-level parallelism. For instance, as 
illustrated in Figure 3, a parallelism profile analysis of 
the 9-body gravitational attraction problem 9 indicates 
that partial evaluation exposed so much low-level paral- 
lelism that in theory, parallel execution could speed up 
the computation by a factor of 69 times faster than a 
uniprocessor execution. 

Achieving the theoretical speedup factor of 69 for the 
9-body problem would require using 516 non-pipelined 
processors capable of instantaneous communication with 
one another. In practice, much of the available paral- 
lelism must be used to keep processor pipelines full, and 
it does take time (latency) to communicate between pro- 
cessors. As the latency of inter-processor communication 
increases, the maximum possible speedup decreases, as 
some of the parallelism must be used to keep each pro- 
cessor busy while awaiting the arrival of results from 
neighboring processors. Communication bandwidth lim- 
itations further restrict how parallelism may be used by 
requiring that most values used by a processor actually 
be produced by that processor. 



a matrix solver examines the numerical values of the ma- 
trix elements in order to choose the best elements to use as 
pivots. [3] describes additional techniques for partially eval- 
uating data-dependent branches, such as generating different 
compiled code for each possible branch direction, and then 
choosing at run-time which set of code to execute. Although 
techniques of this sort can not overcome large-scale control 
flow changes, they have proven quite effective for deahng with 
localized branches such as those associated with the selection 
operators MIN, MAX, and ABS, as well as with piecewise de- 
fined equations. 

The partial-evaluation phase of our compiler is currently 
not very well automated, requiring that the programmer pro- 
vide the compiler with a set of input data structures for each 
data-independent code sequence, as if the data-independent 
sequences are seperate programs being glued together by the 
data-dependent conditional branches. This manual interface 
to the partial evaluator is somewhat of an implementation 
quirk; there is no reason that it could not be more automated. 
Indeed, several Supercomputer Toolkit users have built code 
generation systems on top of our compiler that automati- 
cally generate complete programs, including data-dependent 
conditionals, invoking the partial evaluator to optimize the 
data-independent portions of the program. Recent work by 
Weise, Ruf, and Katz[19, 20, 13] describes additional tech- 
niques for automating the partial-evaluation process across 
data-dependent branches. 

Specifically, one time-step of a 12th-order Stormer in- 
tegration of the gravity-induced motion of a 9-body solar 
system. 
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Figure 3: Parallelism profile of the 9-body problem. This 
graph represents all of the parallelism available in the 
problem, taking into account the varying latency of nu- 
merical operations. 



5 Grain Size vs. Bandwidth 

We have found that bandwidth limitations make it im- 
practical to use critical path based scheduling techniques 
to spread fine-grain parallelism across multiple proces- 
sors. In the latency-limited case investigated by Berlin 
and Weise [5], it is feasible to schedule a fine-grain op- 
eration for parallel execution whenever there is suffi- 
cient time for the operands to arrive at the processor 
doing the computing, and for the result to be trans- 
mitted to its consumers. Hence it is practical to assign 
non-critical-path operations to any available processor. 
Bandwidth limitations destroy this option by limiting 
the number of values that may be transmitted between 
processors, thereby forcing operations that could oth- 
erwise have been computed elsewhere to be computed 
on the processor that is the ultimate consumer of their 
results. Indeed, on the Supercomputer Toolkit archi- 
tecture, which suffers from both latency and bandwidth 
limitations, heuristic techniques similar to those used by 
Berlin and Weise achieved a dismal speedup factor of 
only 2.5 using 8 processors. One possible solution to the 
bandwidth problem is to modify the critical-path based 
scheduling approach to make a much more careful and 
computationally-expensive decision regarding which re- 
sults may be transmitted between processors, and which 
processor a particular result should be computed in. Al- 
though this modification could be achieved by adding a 
backtracking heuristic that searched for different ways 
of assigning each fine-grain instruction to a processor, 10 



Indeed, one possibility would be to design the backtrack- 



this optimization based approach seems computationally 
prohibitive for use on the huge basic blocks produced by 
partial evaluation. 

6 Adjusting the Grain Size 

Rather than extending the critical-path based approach 
to handle bandwidth limitations by searching for a glob- 
ally acceptable fine-grain scheduling solution, we seek to 
hide the bandwidth limitation by increasing the grain- 
size of the operations being scheduled. Prior to initiat- 
ing critical-path based scheduling, we perform a local- 
ity analysis that groups together operations that depend 
so closely on one other that it would not be practical 
to place them in different processors. Each group of 
closely interdependent operations forms a larger grain- 
size instruction, which we refer to as a region. 11 Some 
regions will be large, while others may be as small as 
one fine-grain instruction. In essence, grouping opera- 
tions together to form a region is a way of simplifying 
the scheduling process by deciding in advance that cer- 
tain opportunities for parallel execution will be ignored 
due to limited communication capabilities. 

Since all operations within a region are guaranteed to 
be scheduled onto the same processor, the maximum re- 
gion size must be chosen to match the communication 
capabilities of the target architecture. For instance, if 
regions are permitted to grow too large, a single region 
might encompass the entire data-flow graph, forcing the 
entire computation to be performed on a single proces- 
sor! Although strict limits are therefore placed on the 
maximum size of a region, regions need not be of uniform 
size. Indeed, some regions will be large, corresponding 
to localized computation of intermediate results, while 
other regions will be quite small, corresponding to results 
that are used globally throughout the computation. 

We have experimented with several different heuristics 
for grouping operations into regions. The optimal strat- 
egy for grouping instructions into regions varies with the 
application and with the communication limitations of 
the target architecture. However, we have found that 
even a relatively simple grain-size adjustment strategy 
dramatically improves the performance of the scheduling 
process. For instance, as illustrated in Figure 4, when a 
value is used by only one instruction, the producer and 
consumer of that value may be grouped together to form 
a region, thereby ensuring that the scheduler will not 
place the producer and consumer on different processors 
in an attempt to use spare cycles wherever they hap- 
pened to be available. Provided that the maximum re- 
gion size is chosen appropriately, 12 grouping operations 



ing heuristic based on a simulated annealing search of the 
scheduling configuration space. 

The name region was chosen because we think of the 
grain-size adjustment technique as identifying "regions" of 
locality within the data-flow graph. The process of grain-size 
adjustment is closely related to the problem of graph multi- 
section, although our region-finder is somewhat more partic- 
ular about the properties (shape, size, and connectivity) of 
each "region" sub-graph than are typical graph multisection 
algorithms. 

The region size must be chosen such that the compu- 




Figure 4: A Simple Region Forming Heuristic. A 

region is formed by grouping together operations that 
have a simple producer/consumer relationship. This pro- 
cess is invoked repeatedly, with the region growing in size 
as additional producers are added. The region-growing 
process terminates when no suitable producers remain, 
or when the maximum region size is reached. A pro- 
ducer is considered suitable to be included in a region if 
it produces its result solely for use by that region. (The 
numbers shown within each node reflect the computa- 
tional latency of the operation.) 



together based on locality prevents the scheduler from 
making gratuitous use of the communication channels, 
forcing it to focus on scheduling options that make more 
effective use of the limited communication bandwidth. 

An important aspect of grain-size adjustment is that 
the grain-size is not increased uniformly. As shown in 
Figure 5, some regions are much larger than others. In- 
deed, it is important not to forcibly group non-localized 
operations into regions simply to increase the grain-size. 
For example, it is likely that the result produced by an 
instruction that has many consumers will be transmitted 
amongst the processors, since it would not be practical 
to place all of the consumers on the result-producing pro- 
cessor. In this case, creating a large region by grouping 
together the producer with only some of the consumers 
would increase the grain-size, but would not reduce inter- 
processor communication, since the result would need to 
be transmitted anyway. In other words, it only makes 
sense to limit the scheduler's options by grouping opera- 
tions together when doing so will reduce inter-processor 
communication . 

7 Parallel Scheduling 

Exploiting locality by grouping operations into regions 
forces closely-related operations to occur on the same 



tational latency of the operations grouped together is well- 
matched to the communication bandwidth limitations of the 
architecture. If the regions are made too large, communi- 
cation bandwidth will be underutilized since the operations 
within a region do not transmit their results. 
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Figure 5: The numerical operations in the 9-body pro- 
gram were divided into regions based on locality. This 
table shows how region size can vary depending on the lo- 
cality structure of the computation. Region size is mea- 
sured by as measured by computational latency (cycles). 
The program was divided into 292 regions, with an av- 
erage region size of 7.56 cycles. 



processor. Although this reduces inter-processor com- 
munication requirements, it also eliminates many op- 
portunities for parallel execution. Figure 6 shows the 
parallelism remaining in the 9-body problem after oper- 
ations have been grouped into regions. Comparison with 
Figure 3 shows that increasing the grain-size eliminated 
about half of the opportunities for parallel execution. 
The challenge facing the parallel scheduler is to make ef- 
fective use of the limited parallelism that remains, while 
taking into consideration such factors as communication 
latency, memory traffic, pipeline delays, and allocation 
of resources such as processor buses and inter-processor 
communication channels. 

The Supercomputer Toolkit compiler schedules oper- 
ations for parallel execution in two phases. The first 
phase, known as the region-level scheduler, is primar- 
ily concerned with coarse-grain assignment of regions to 
processors, generating a rough outline of what the final 
program will look like. The region-level scheduler assigns 
each region to a processor; determines the source, des- 
tinations, and approximate time of transmission of each 
inter-processor message; and determines the preferred 
order of execution of the regions assigned to each pro- 
cessor. The region-level scheduler takes into account the 
latency of numerical operations, the inter-processor com- 
munication capabilities of the target architecture, the 
structure (critical path) of the computation, and which 
data values each processor will store in its memory. How- 
ever, the region-level scheduler does not concern itself 
with finer-grain details such as the pipeline structure of 
the processors, the detailed allocation of each communi- 
cation channel, or the ordering of individual operations 
within a processor. At the coarse grain-size associated 
with the scheduling of regions, a straightforward set of 
critical-path based scheduling heuristics 13 have proven 



13 The heuristics used by the region-level scheduler are 
closely related to list-scheduling [8]. A detailed discussion of 
the heuristics used by the region-level scheduler is presented 
in [22]. 
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Figure 6: Parallelism profile of the 9-body problem after 
operations have been grouped together to form regions. 
Comparison with Figure 3 clearly shows that increasing 
the grain-size significantly reduced the opportunities for 
parallel execution. In particular, the maximum speedup 
factor dropped from 98 times faster to only 49 times 
faster than a single processor. 



quite effective. For the 9-body problem example, the 
computational load was spread so evenly that the varia- 
tion in utilization efficiency among the 8 processors was 
only 1%. 

The final phase of the compilation process is 
instruction-level scheduling. The region-level scheduler 
provides the instruction-level scheduler with a set of op- 
erations to execute on each processor, along with a set 
of preferences regarding the order in which those oper- 
ations should be computed, and a list of results that 
need to be transmitted among the processors. The 
instruction-level derives low-level pipelined instructions 
for each processor, chooses the exact time and commu- 
nication channel for each inter-processor transmission, 
and determines where values will be stored within each 
processor. The instruction-level scheduler chooses the 
final ordering of the operations within each processor, 
taking into account processor pipelining, register allo- 
cation, memory access restrictions, and availability of 
interprocessor-communication channels. Whenever pos- 
sible, the order of operations is chosen so as to match the 
preferences of the region-level scheduling phase. How- 
ever, the instruction-level scheduler is free to reorder op- 
erations as needed, intertwining operations without re- 
gard to which coarse-grain region they were originally a 
member of. 

The instruction-level scheduler begins by performing a 
data-use analysis to determine which instructions share 
data values and should therefore be placed near each 
other for register allocation purposes. The scheduler 
combines the data-use information with the instruction- 
ordering preferences provided by the region-level sched- 
uler to produce a scheduling priority for each instruction. 
The scheduling process is performed one cycle at a time, 
performing scheduling of a cycle on all processors be- 
fore moving on to the next cycle. Instructions compete 
for resources based on their scheduling priority; in each 
cycle, the highest-priority operation whose data and pro- 
cessor resources are available will be scheduled. Due to 
this competition for data and resources, operations may 
be scheduled out of order if their resources happen to 
be available, in order to keep the processor busy. In- 
deed, when the performance of the instruction-scheduler 
is measured independently of the region-scheduler, by 
generating code for a single VLIW processor, utilization 
efficiencies in excess of 99.7% are routinely achieved, rep- 
resenting nearly optimal code. 

An aspect of the scheduler that has proven to be 
particularly important is the retroactive scheduling of 
memory references. Although computation instructions 
(such as + or *) are scheduled on a cycle-by-cycle basis, 
memory LOAD instructions are scheduled retroactively, 
wherever they happen to fit in. For instance, when a 
computation instruction requires that a value be loaded 
into a register from memory, the actual memory access 
operation 14 is scheduled in the past for the earliest mo- 



On the toolkit architecture, two memory operations may 
occur in parallel with computation and address-generation 
operations. This ensures that retroactively scheduled mem- 
ory accesses will not interfere with computations from previ- 
ous cycles that have already been scheduled. 



ment at which both a register and a memory-bus cycle 
are available; the memory operation may occur 50 or 
even 100 instructions earlier than the computation in- 
struction. Since on the Supercomputer Toolkit, mem- 
ory operations must compete for bus access with inter- 
processor messages, retroactive scheduling of memory 
references helped to avoid interference between memory 
and communication traffic. 

8 Performance Measurements 

The Supercomputer Toolkit and its associated compiler 
have been used for a wide variety of applications, rang- 
ing from computation of human genetic pedigrees to the 
simulation of electrical circuits. The applications that 
have generated the most interest from the scientific com- 
munity involve various integrations of the N-body grav- 
itational attraction problem. 15 Parallelization of these 
integrations has been previously studied by Miller[15], 
who parallelized the program by using futures to man- 
ually specify how parallel execution should be attained. 
Miller shows how one can re-write the N-body program 
so as to eliminate sequential data structure accesses to 
provide more effective parallel execution, manually per- 
forming some of the optimizations that partial evaluation 
provides automatically. Others have developed special- 
purpose hardware that parallelizes the 9-body problem 
by dedicating one processor to each planet. [2] Previous 
work in partial evaluation [3, 4, 5] has shown that the 
9-body problem contains large amounts of fine-grain par- 
allelism, making it plausible that more subtle paralleliza- 
tions are possible without the need to dedicate one pro- 
cessor to each planet. 

We have measured the effectiveness of coupling partial 
evaluation with grain-size adjustment to generate code 
for the Supercomputer Toolkit parallel computer, an ar- 
chitecture that suffers from serious interprocessor com- 
munication latency and bandwidth limitations. Figure 7 
shows the parallel speedups achieved by our compiler for 
several different N-body interaction applications. Fig- 
ure 9 focuses on the 9-body program (ST9) discussed ear- 
lier in this paper, illustrating how the parallel speedup 
varies with the number of processors used. Note that 
as the number of processors increases beyond 10, the 
speedup curves level off. A more detailed analysis has 
revealed that this is due to the saturation of the inter- 
processor communication pathways, as illustrated in Fig- 
ure 10. 

9 Related Work 

The use of partial evaluation to expose parallelism makes 
our approach to parallel compilation fundamentally dif- 
ferent from the approaches taken by other compilers. 
Traditionally, compilers have maintained the data struc- 
tures and control structure of the original program. For 
example, if the original program represented an object 
as a doubly-linked list of numbers, the compiled program 
would as well. Only through partial evaluation can the 



For instance, [23] describes results obtained using the 
Supercomputer Toolkit that prove that the solar system is 
chaotic. 
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Figure 7: Speedups of various applications running on 8 
processors. Four different computations have been com- 
piled in order to measure the performance of the com- 
piler: a 6 particle stormer integration(ST6), a 9 particle 
stormer integration(ST9), a 12 particle stormer integra- 
tion(ST12), and a 9 particle fourth-order Runge Kutta 
integration. Speedup is the single processor execution 
time of the computation divided by the total execution 
time on the multiprocessor. 
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Figure 9: Speedup graph of Stormer integrations. Am- 
ple speedups are available to keep the 8-processor Su- 
percomputer Toolkit busy, However, the incremental im- 
provement of using more than 10 processors is relatively 
small. 
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Figure 8: The result of scheduling the 9-body problem 
onto 8 Supercomputer Toolkit processors. Comparison 
with with the region-level parallelism profile (figure 6) 
illustrates how the scheduler spread the course-grain par- 
allelism across the processors. A total of 340 cycles are 
required to complete the computation. On average, 6.5 
of the 8 processors are utilized during each cycle. 
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Figure 10: Utilization of the inter-processor communi- 
cation pathways. The communication system becomes 
saturated at around 10 processors. This accounts for 
the lack of incremental improvement available from us- 
ing more than 10 processors that was seen in Figure 9. 



data structures used by the programmer to think about 
the problem be removed, leaving the compiler free to 
optimize the underlying numerical computation, unhin- 
dered by sequentially-accessed data structures and pro- 
cedure calls. 

Many compilers for high-performance architectures 
use program transformations to exploit low-level paral- 
lelism. For instance, compilers for vector machines un- 
roll loops to help fill vector registers. [18] Other paral- 
lelization techniques include trace-scheduling, software 
pipelining, vectorizing, as well as static and dynamic 
scheduling of data-flow graphs. 

9.1 Trace Scheduling 

Compilers that exploit fine-grain parallelism often em- 
ploy trace-scheduling techniques [9] to guess which way a 
branch will go, allowing computations beyond the branch 
to occur in parallel with those that precede the branch. 
Our approach differs in that we use partial evaluation to 
take advantage of information about the specific applica- 
tion at hand, allowing us to totally eliminate many data- 
independent branches, producing basic blocks on the 
order of several thousands of instructions, rather than 
the 10-30 instructions typically encountered by trace- 
scheduling based compilers. An interesting direction for 
future work would be to add trace-scheduling to our ap- 
proach, to optimize across the data-dependent branches 
that occur at basic block boundaries. 

Most trace-scheduling based compilers use a variant 
of List-scheduling[8] to parallelize operations within an 



individual basic block. Although list-scheduling using 
critical-path based heuristics is very effective when the 
grain-size of the instructions is well-matched to the inter- 
processor communication bandwidth, we have found 
that in the case of limited bandwidth, a grain-size ad- 
justment phase is required to make the list-scheduling 
approach effective. 

9.2 Software Pipelining 

Software Pipelining [11] optimizes a particular fixed size 
loop structure such that several iterations of the loop 
are started on different processors at constant intervals 
of time. This increases the throughput of the compu- 
tation. The effectiveness of software pipelining will be 
determined by whether the grain-size of the parallelism 
expressed in the looping structure employed by the pro- 
grammer matches the architecture: software pipelining 
can not parallelize a computation that has its parallelism 
hidden behind inherently sequential data references and 
spread across multiple loops. The partial-evaluation ap- 
proach on such a loop structure would result in the loop 
being completely unrolled with all of the sequential data 
structure references removed and all of the fine grain 
parallelism in the loop's computation exposed and avail- 
able for parallelization. In some applications, especially 
those involving partial differential equations, fully un- 
rolling loops may generate prohibitively large programs. 
In these situations, partial evaluation could be used to 
optimize the innermost loops of a computation, with 
techniques such as software pipelining used to handle 
the outer loops. 

9.3 Vectorizing 

Vectorizing is a commonly used optimization for vec- 
tor supercomputers, executing operations on each vec- 
tor element in parallel. This technique is highly effec- 
tive provided that the computation is composed primar- 
ily of readily identifiable vector operations (such as ma- 
trix multiplication). Most vectorizing compilers gener- 
ate vector code from a scalar specification by recogniz- 
ing certain standard looping constructs. However, if the 
source program lacks the necessary vector-accessing loop 
structure, the programs do very poorly. For computa- 
tions that are mostly data-independent, the combina- 
tion of partial evaluation with static scheduling tech- 
niques has the potential to be vastly more effective than 
vectorization. Whereas a vectorizing compiler will of- 
ten fail simply because the computation's structure does 
not lend itself to a vector-oriented representation, the 
partial-evaluation/static scheduling approach can often 
succeed by making use of very fine-grained parallelism. 
On the other hand, for computations that are highly 
data-dependent, or which have a highly irregular struc- 
ture that makes unrolling loops infeasible, vectorizing 
remains an important option. 

9.4 Iterative Restructuring 

Iterative restructuring represents the manual approach 
to parallelization. Programmer's write and rewrite their 
code until the parallelizer is able to automatically rec- 
ognize and utilize the available parallelism. There are 



many utilities for doing this, some of which are discussed 
in [7] . This approach is not flexible in that whenever one 
aspect of the computation is changed, one must ensure 
that parallelism in the changed computation is fully ex- 
pressed by the loop and data-reference structure of the 
program. 

9.5 Static Scheduling 

Static scheduling of the fine-grained parallelism embed- 
ded in large basic blocks has also also been investigated 
for use on the Oscar architecture at Waseda University in 
Japan. [12]. The Oscar compiler uses a technique called 
task fusion that is similar in spirit to the grain-size ad- 
justment technique used on the Supercomputer Toolkit. 
However, the Oscar compiler lacks a partial-evaluation 
phase, leaving it to the programmer to manually gen- 
erate large basic blocks. Although the manual creation 
of huge basic blocks (or of automated program genera- 
tors) may be practical for computations such as an FFT 
that have a very regular structure, this is not a reason- 
able alternative for more complex programs that require 
abstraction and complex data structure representations. 
For example, imagine writing out the 11,000 floating- 
point operations for the Stormer integration of the So- 
lar system and then suddenly realizing that you need 
to change to a different integration method. The man- 
ual coder would grimace, whereas a programmer writing 
code for a compiler that uses partial evaluation would 
simply alter a high-level procedure call. It appears that 
the compiler for Oscar could benefit a great deal from 
the use of partial evaluation. 

10 Conclusions 

Partial evaluation has an important role to play in the 
parallel compilation process, especially for largely data- 
independent programs such as those associated with 
numerically-oriented scientific computations. Our ap- 
proach of adjusting the grain size of the computation to 
match the architecture was possible only because of par- 
tial evaluation: If we had taken the more conventional 
approach of using the structure of the program to detect 
parallelism, we would then be stuck with the grain-size 
provided us by the programmer. By breaking down the 
program structure to its finest level, and then imposing 
our own program structure (regions) based on locality 
of reference, we have the freedom to choose the grain- 
size to match the architecture. The coupling of partial 
evaluation with static scheduling techniques in the Su- 
percomputer Toolkit compiler has allowed scientists to 
write programs that reflect their way of thinking about 
a problem, eliminating the need to write programs in an 
obscure style that makes parallelism more apparent. 
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A Appendix: Architecture of the 
Supercomputer Toolkit 

The Supercomputer Toolkit is a MIMD computer. It 
consists of eight separate VLIW(Very Long Instruction 
Word) processors and a configurable interconnection net- 
work. A detailed review of the Supercomputer Toolkit ar- 
chitecture may be found in [1]. Each toolkit processor 
has two bi-directional communication ports that may 
be connected to form various communication topologies. 
The parallelizing compiler is targeted for a configuration 
in which all of the processors are interconnected by two 
independent shared communication buses. The proces- 
sors operate in lock-step, synchronized by a master clock 
that ensures they begin each cycle at the same moment. 
Each processor has its own program-counter, allowing 
independent tasks to be performed by each processor. A 
single "global" condition flag that spans the 8-processors 
provides the option of having the individual processors 
act together so as to emulate a ULIW (ultra-long in- 
struction word) computer. 

B The Toolkit Processor 

Figure 11 shows the architecture of each processor. The 
design is symmetric and is intended to provide the 
memory bandwidth needed to take full advantage of 
instruction-level parallelism. Each processor has a 64- 
bit-floating-point chip set, a five-port 32x64-bit register 
file, two separately addressable data memories, two in- 
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Figure 11: This is the overall architecture of a Supercom- 
puter Toolkit processor node, consisting of a fast floating- 
point chip set, a 5-port register file, two memories, two inte- 
ger alu address generators, and a sequencer. 



teger processors for memory address generation, two 
I/O ports, a sequencer, and a separate instruction mem- 
ory. The processor is pipelined and is thus capable of 
initiating the following instructions in parallel during 
each clock cycle: a left memory-I/O operation, a right 
memory-I/O operation, an FALU operation, 17 an FMUL 
operation 18 , and a sequencer operation. 19 The com- 
piler takes full advantage of the architecture, scheduling 
computation instructions in parallel with memory op- 
erations or communication. The Toolkit is completely 
synchronous and clocked at 12.5 Mhz. When both the 
FALU and FMUL are utilized, the Toolkit is capable 
of a peak rate of 200 Megaflops, 25 on each board. The 
compiler typically achieves approximately 1/2 of this ca- 
pability because it does not attempt to simultaneously 
utilize both the FMUL and the FALU. 20 

The compiler allocates two of the 32 registers for 
communication purposes (data buffering), while 3 reg- 
isters are reserved for use by the hardware itself. Thus 



Each memory address generator processor consists of an 
integer processor tied closely to a local register file. 

The FALU is capable of doing integer operations, most 
floating-point operations, and many other one-cycle opera- 
tions It is tagged + in figure f f 

The FMUL is capable of doing floating-point multiplies(f 
cycle latency), floating-point division(5 cycle latency), and 
floating-point square roots(9 cycle latency) as well as many 
other operations. It is tagged * in figure ff 

The sequencer contains a small local memory for han- 
dling stack operations. 

Simultaneous utilization of the FMUL and FALU is only 
occasionally worthwhile for long multiply-accumulate opera- 
tions. Since the FMUL and FALU share their register-file 
ports, opportunities for making simultaneous use of both 
units are rare. 



26 registers are available for use by scheduled compu- 
tations. The floating-point chips have a three stage 
pipeline whereby the result of an operation initiated on 
cycle N will be available in the output latch on cycle 
1 + N + L, where L is the latency of the computation. 
The result can then be moved in the register-file during 
any of the following cycles, until the result is moved into 
the output latch. There are feedback (pipeline bypass) 
paths in the floating-point pipeline that allow computed 
results to be fed back for use as operands in the next 
cycle. The compiler takes advantage of these feedback 
mechanisms to reduce register utilization, 

The bus that connects the memory, I/O port, and 
register-file is a resource bottleneck, allowing either a 
memory load, a memory store, an I/O transmission, or 
an I/O reception to be scheduled during each cycle. This 
bus appears twice in the architecture, in each of the two 
independent memory/1-0 subsystems. 

C Interconnection Network and 
Communication 

The toolkit allows for flexible interconnection among the 
boards through its two I/O ports. The interconnec- 
tion scheme is not fixed and many configurations are 
possible, although changing the configuration requires 
manual insertion of connectors. The compiler currently 
views this network as two separate buses: a left and a 
right bus. Each processor is connected to both buses 
through its left and right I/O ports. This configuration 
was chosen as the one that would place the fewest local- 
ity restrictions on the types of programs that could be 
compiled efficiently. However, targeting the compiler for 
other configurations, such as a single shared bus on the 
left side, with pairwise connections between processors 
on the right side, may prove advantageous for certain 
applications. Each transmission requires two cycles to 
complete. Thus in the two shared-bus 8-processor con- 
figuration, only one out of every eight results may be 
transmitted. Pipeline latencies introduce a six cycle de- 
lay between the time that a value produced on one pro- 
cessor is available for use by the floating-point unit of 
another processor. 

The hardware permits any processor to transmit a 
value at any time, relying on software to allocate the 
communication channels to a particular processor for any 
given cycle. Once a value is transmitted, each receiving 
processor must explicitly issue a "receive" instruction 
one cycle after the transmission occurred. The compiler 
allocates the communication pathways on a cycle by cy- 
cle basis, automatically generating the appropriate send 
and receive instructions. 
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